The high-throughput poly(A) RNA-seq data used in this notebook are described in Neeves et al, Brain (2022). They are derived from nuclear and cytoplasmic fractions of human induced pluripotent stem cells (hiPSC; day 0), neural precursors (NPC; day 3 and day 7), āpatternedā precursor motor neurons (ventral spinal cord; pMN; day 14), post-mitotic but electrophysiologically immature motor neurons (MN; day 22), and electrophysiologically active MNs (mMNs; day 35).
The gene expression count data was obtained from Kallisto (Bray et
al., 2016) using the Gencode hg38 release Homo sapiens transcriptome.
The quantification of alternative splicing was performed using VAST-TOOLS (Tapial et
al., 2017). The data required for this practical session can be
downloaded from Zenodo
(please note this is a different version of the one in the
previous module).
load("./AS_GE_data.RData")
#Data:
# info : data-frame of information for the nuclear samples
# myE_gen : noramlised gene expression count matrix of the CTRL nuclear fraction quantile-normalised
# dat_diff_time_nuc_ctrl : data-frame with different AS analysis for the nuclear compartment
# dat_diff_time_cyto_ctrl : data-frame with different AS analysis for the cytoplasmic compartment
# tab_psi : data frame with all events in VAST-TOOLS and their PSI values in each sample
#Here I make some nice colors to facilitate the visualisation
mygroup <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))#Biological Pathway Gene Enrichment Analysis
GO_analysis <- function(genes_list){
gostres_diff <- gost(query = genes_list,
organism = "hsapiens", ordered_query = FALSE,
multi_query = FALSE, significant = TRUE, exclude_iea = FALSE,
measure_underrepresentation = FALSE, evcodes = TRUE,
user_threshold = 0.05, correction_method = "g_SCS",
domain_scope = "annotated", custom_bg = NULL,
numeric_ns = "", as_short_link = FALSE,sources=c("GO:BP", "GO:MF","KEGG"))
gostplot(gostres_diff, capped = FALSE, interactive = TRUE)#please note this is going to create an interactive plot
}
#Create a venn diagram from two gene lists
## This is an example - not meant to run
vennDiag <- function(genes_lists){
genes_comparisons <- do.call(what=cbind,args=lapply(genes_lists,function(Z)return(rownames(myE_gen)%in%Z)))
colnames(genes_comparisons)<-c("cond1","cond2")
vennDiagram(genes_comparisons,main="blasdfas")
}
#Plot PSI events over time in nuclear cyto fractions
plotEventOverTime <- function(event="HsaINT0002504"){
toi <- c("E.dPsi.d_3_0","E.dPsi.d_7_0","E.dPsi.d_14_0","E.dPsi.d_22_0","E.dPsi.d_35_0")
dat_nuc <- dat_diff_time_nuc_ctrl[match(event,dat_diff_time_nuc_ctrl$EVENT),match(toi,colnames(dat_diff_time_nuc_ctrl))]
dat_cyto<- dat_diff_time_cyto_ctrl[match(event,dat_diff_time_cyto_ctrl$EVENT),match(toi,colnames(dat_diff_time_cyto_ctrl))]
tempdat <- rbind(as.numeric(dat_nuc),as.numeric(dat_cyto))
rownames(tempdat)<-c("nuc","cyto")
colnames(tempdat)<- c("3","7","14","22","35")
tempdat[is.na(tempdat)]<-0
barplot(tempdat,beside=TRUE,las=1,ylab="dPSI",xlab="DIV",main=paste(tab_psi$GENE[match(event,tab_psi$EVENT)], "-", tab_psi$COORD[match(event,tab_psi$EVENT)]),col=c("#CC9C3D","#476DB4"))
legend("topleft",ncol=1,pch=15,col=c("#CC9C3D","#476DB4"),leg=c("nuc","cyto"),cex=0.7)
grid()
}
#To following function sort myvec in decreasing order and return their indexes:
myidx<- sort(x=c(1:2000),index.return=TRUE,decreasing=TRUE)$ixdat_diff_time_nuc_ctrl data-table) and cytoplasmic
fractions (dat_diff_time_cyto_ctrl). The schematic below
shows the four main types of events we are going to look at.
Here is a summary of the column content which you can find at on Github repo of VAST-TOOL:
First you always need to make sure you understand the format of your
data, the content of the columns and the rows. Start by checking the
dimensions of the data using dim, nrow,
ncol functions. Then if the data-count table is not too
large, you can use the View function to visualise and
explore its content. Alternatively you can print a couple
of rows. Here you need to understand the output from VAST-TOOLS.
## [1] 365711 102
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "0_CB1D_cyto" "0_CB1D_nuc"
## [9] "0_CB1E_cyto" "0_CB1E_nuc" "0_CTRL1_cyto" "0_CTRL1_nuc"
## [13] "0_CTRL3_cyto" "0_CTRL3_nuc" "0_CTRL4_cyto" "0_CTRL4_nuc"
## [17] "0_CTRL5_cyto" "0_CTRL5_nuc" "0_GliA_cyto" "0_GliA_nuc"
## [21] "0_GliB_cyto" "14_CB1D_cyto" "14_CB1D_nuc" "14_CB1E_cyto"
## [25] "14_CB1E_nuc" "14_CTRL1_cyto" "14_CTRL1_nuc" "14_CTRL3_cyto"
## [29] "14_CTRL3_nuc" "14_CTRL4_cyto" "14_CTRL4_nuc" "14_CTRL5_cyto"
## [33] "14_CTRL5_nuc" "14_GliA_cyto" "14_GliA_nuc" "14_GliB_cyto"
## [37] "14_GliB_nuc" "22_CB1D_cyto" "22_CB1D_nuc" "22_CB1E_cyto"
## [41] "22_CB1E_nuc" "22_CTRL1_cyto" "22_CTRL1_nuc" "22_CTRL3_cyto"
## [45] "22_CTRL3_nuc" "22_CTRL4_cyto" "22_CTRL4_nuc" "22_CTRL5_cyto"
## [49] "22_CTRL5_nuc" "22_GliA_cyto" "22_GliA_nuc" "22_GliB_cyto"
## [53] "22_GliB_nuc" "35_CB1D_cyto" "35_CB1D_nuc" "35_CB1E_cyto"
## [57] "35_CB1E_nuc" "35_CTRL1_cyto" "35_CTRL1_nuc" "35_CTRL3_cyto"
## [61] "35_CTRL3_nuc" "35_CTRL4_cyto" "35_CTRL4_nuc" "35_CTRL5_cyto"
## [65] "35_CTRL5_nuc" "35_GliA_cyto" "35_GliA_nuc" "35_GliB_cyto"
## [69] "35_GliB_nuc" "3_CB1D_cyto" "3_CB1D_nuc" "3_CB1E_cyto"
## [73] "3_CB1E_nuc" "3_CTRL1_cyto" "3_CTRL1_nuc" "3_CTRL3_cyto"
## [77] "3_CTRL3_nuc" "3_CTRL4_cyto" "3_CTRL4_nuc" "3_CTRL5_cyto"
## [81] "3_CTRL5_nuc" "3_GliA_cyto" "3_GliA_nuc" "3_GliB_cyto"
## [85] "3_GliB_nuc" "7_CB1D_cyto" "7_CB1D_nuc" "7_CB1E_cyto"
## [89] "7_CB1E_nuc" "7_CTRL1_cyto" "7_CTRL1_nuc" "7_CTRL3_cyto"
## [93] "7_CTRL3_nuc" "7_CTRL4_cyto" "7_CTRL4_nuc" "7_CTRL5_cyto"
## [97] "7_CTRL5_nuc" "7_GliA_cyto" "7_GliA_nuc" "7_GliB_cyto"
## [101] "7_GliB_nuc" "COMPLEX_short"
## [1] 365711 21
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3"
## [13] "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
## [1] 365711 21
## [1] "GENE" "EVENT" "COORD" "LENGTH"
## [5] "FullCO" "COMPLEX" "E.dPsi.d_3_0" "MV.dPsi.d_3_0"
## [9] "E.dPsi.d_7_0" "MV.dPsi.d_7_0" "E.dPsi.d_7_3" "MV.dPsi.d_7_3"
## [13] "E.dPsi.d_14_0" "MV.dPsi.d_14_0" "E.dPsi.d_22_0" "MV.dPsi.d_22_0"
## [17] "E.dPsi.d_35_0" "MV.dPsi.d_35_0" "E.dPsi.d_35_22" "MV.dPsi.d_35_22"
## [21] "COMPLEX_short"
It is generally good to get an understanding of the type of events which are differentially spliced between two conditions. Pie-charts can be useful to visualise the relative abundances of events in each categories between different types of conditions. \(\Delta\)PSI>0 indicates inclusion while \(\Delta\)PSI<0 indicates skipping between two conditions. The default threshold of a difference in PSI is 15%.
Letās start by looking at the relative number of events with a threshold difference of 20% between the iPSC stage (DIV=0) and the MN stage (DIV=35) in the nucleus:
col_events=c("#CD4599","#8C8C8C","#6ABD45","#4B5493","#FEC20F")
names(col_events) <- names(table(dat_diff_time_nuc_ctrl$COMPLEX_short))
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="NUCLEUS -- DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_nuc_ctrl$COMPLEX_short[dat_diff_time_nuc_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)And compare these with event distribution in cytoplasmic fractions:
par(mfrow=c(1,3),mar=c(1,1,2,1))
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short),col=col_events)
mtext(side=3,line=-1,text="All annotated events",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0>0.2]),col=col_events)
mtext(side=3,line=0,text="CYTOPLASM --DIV=35",cex=0.7)
mtext(side=3,line=-1,text="dPSI>20%",cex=0.7)
pie(table(dat_diff_time_cyto_ctrl$COMPLEX_short[dat_diff_time_cyto_ctrl$E.dPsi.d_35_0<(-0.2)]),col=col_events)
mtext(side=3,line=-1,text="dPSI<(-20%)",cex=0.7)What are the key observations?
Having found some differences at MN stages in relative distributions of included events between cyto and nuc (enrichment in nuclear fraction of IR), letās now test whether splicing patterns are similar in nucleus versus cytoplasm over time.
pcc_events<- do.call(what=cbind,args=lapply(unique(dat_diff_time_cyto_ctrl$COMPLEX_short),function(EV)return(unlist(lapply(c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0"),function(coln)return(cor(dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_nuc_ctrl))],dat_diff_time_cyto_ctrl[dat_diff_time_cyto_ctrl$COMPLEX_short==EV,match(coln,colnames(dat_diff_time_cyto_ctrl))],use = "complete")))))))
colnames(pcc_events)<- unique(dat_diff_time_cyto_ctrl$COMPLEX_short)
rownames(pcc_events)<-c("E.dPsi.d_3_0", "E.dPsi.d_7_0" , "E.dPsi.d_14_0" , "E.dPsi.d_22_0" , "E.dPsi.d_35_0")
par(mfrow=c(1,1))
barplot(pcc_events,beside=TRUE,col=unlist(lapply(colnames(pcc_events),function(Z)return(rep(col_events[match(Z,names(col_events))],5)))),las=1,ylim=c(0,1),ylab="Pearson Correlation Coefficient")Letās now visualise these results, in particular the changes between NPC and MN stages in terms of alternative exon usage and intron retention:
par(mfrow=c(2,2),mar=c(4,4,4,4))
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
grid()
abline(0,1,col="red",lty=2)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=14]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_14_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_14_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Intron Retention",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="IR"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="IR"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)
plot(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],pch=19,col=rgb(0,0,0,0.2),cex=0.5,xlab="nucleus",ylab="cytoplasm",main="",las=1)
mtext(side=3,line=2,text="Alternative Exon",cex=0.7)
mtext(side=3,line=1,text="dPSI [DIV=0-->DIV=35]",cex=0.7)
mtext(side=3,line=0,text=paste("PCC=",round(cor(dat_diff_time_nuc_ctrl$E.dPsi.d_35_0[dat_diff_time_nuc_ctrl$COMPLEX_short=="EX"],dat_diff_time_cyto_ctrl$E.dPsi.d_35_0[dat_diff_time_cyto_ctrl$COMPLEX_short=="EX"],use = "complete"),digit=2)),cex=0.7)
grid()
abline(0,1,col="red",lty=2)If interested have a look at the role of nuclear detention of intron-retaining transcripts.
While hierarchical clustering might be dominated by some changes in few genes, other methods such as singular value decomposition (SVD) analysis can be instrumental in identifying independent biological pathways underlying changes in gene expression. It is also very appropriate for time-series analysis when variation is not linearly dependent on time.
#This function extract the fraction of variance per component as derived from SVD analysis
#Input= SVD object
getFractionVariance<- function(mySVD){
return(mySVD$d*mySVD$d/sum(mySVD$d*mySVD$d))
}
#This function calculate the Shannon Entropy which indicates how well the information is spread among the principal components.
#High value indicates information evenly distributed among components
#Low value indicates most information/variance is captured by a single component
#Takes as input the fraction of variance outputted from getFractionVariance
getShannonEntropy <- function(pip)return(-1*sum(pip*log10(pip))/log10(length(pip)))
#This function plots the fraction of variance captured by first N components.
#pi=fraction of variance as obtained from getFractionVariance
#dp=Shannon Entropy as obtained from getShannonEntropy
#N=number of components to plot
ScrePlot <- function(pi,dp,N=NA){
if(is.na(N)){
barplot(pi,las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
else{
barplot(pi[c(1:N)],las=1,cex.main=0.7,cex.axis=1.0,col="black")
}
mtext(side = 1, line = 2, text ="principal components", col = "black",cex=0.7, font=1)
mtext(side = 2, line = 2, text ="Fraction of explained variance", col = "black",cex=0.7, font=1)
mtext(side = 3, line = -2, text = paste("Shannon Entropy = ",round(dp,digits=3)), col = "black",cex=0.7, font=1)
}For this analysis we will focus on the nuclear samples and use the different types of data collected so far i.e.Ā gene expression, PUD, IR and AltEx. A similar analysis can be done on the cytoplasmic fraction.
ge <- myE_gen
EX_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="EX"),match(colnames(myE_gen),colnames(tab_psi))])
IR_nuc <- as.matrix(tab_psi[which(tab_psi$COMPLEX_short=="IR"),match(colnames(myE_gen),colnames(tab_psi))])
rownames(EX_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="EX")]
rownames(IR_nuc) <- as.character(tab_psi$EVENT)[which(tab_psi$COMPLEX_short=="IR")]
#You need to remove NA's and NaN for the unsupervised analysis
tosel_ex <- which(apply(is.na(EX_nuc),1,sum)==0&apply(is.nan(EX_nuc),1,sum)==0)
tosel_ir <- which(apply(is.na(IR_nuc),1,sum)==0&apply(is.nan(IR_nuc),1,sum)==0)
EX_nuc <- EX_nuc[tosel_ex,]
IR_nuc <- IR_nuc[tosel_ir,]It is common in SVD analysis to remove the first component which captures noise == most variations from from systematic changes in basal gene expression between genes.
#SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
pi_1 <- getFractionVariance(mySVD=SVD_eset)
dp_1 <- getShannonEntropy(pi_1)
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
pi_2 <- getFractionVariance(mySVD=SVD_eset_GE)
dp_2 <- getShannonEntropy(pi_2)
par(mfrow=c(2,4))
ScrePlot(pi_1,dp_1,N=NA)
mtext(side=3,line=0,text="prior removing first component")
#Effect on the sample distribution
multidensity(dat1[,c(1,10,15,20)],main="prior filtering",las=1)
boxplot(dat1,outline=FALSE)
boxplot(t(dat1[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")
ScrePlot(pi_2,dp_2,N=NA)
mtext(side=3,line=0,text="after removing first component")
multidensity(datm[,c(1,10,15,20)],main="after filtering",las=1)
boxplot(datm,outline=FALSE)
#Effect on the genes distribution
boxplot(t(datm[c(1,4,10,44,69,1000,2000,4000),]),outline=FALSE,las=2,ylab="log2(count)")We start by performing SVD on individual matrices and compare how information is captured in the different matrices.
layout(matrix(c(1:6),ncol=3,nrow=2,byrow=FALSE))
#1. SVD on the Gene Expression
dat1 <- ge
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="Gene Expression")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_GE <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_GE),getShannonEntropy(getFractionVariance(SVD_eset_GE)))
mtext(side=3,line=0,text="Gene Expression")
#2. SVD on the AltEx
dat1 <- EX_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="AltEx")
mtext(side=3,line=1,text="Prior removing first component")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_Ex <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_Ex),getShannonEntropy(getFractionVariance(SVD_eset_Ex)))
mtext(side=3,line=1,text="After removing first component")
mtext(side=3,line=0,text="AltEx")
#3. SVD on the IR
dat1 <- IR_nuc
SVD_eset <- svd(dat1)
ScrePlot(getFractionVariance(SVD_eset),getShannonEntropy(getFractionVariance(SVD_eset)))
mtext(side=3,line=0,text="IR")
datm <- dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR <- svd(datm)
ScrePlot(getFractionVariance(SVD_eset_IR),getShannonEntropy(getFractionVariance(SVD_eset_IR)))
mtext(side=3,line=0,text="IR")Letās now compare their first PCs to test who the time is captured by the different matrices in the different SVD.
#PCA plots
par(mfrow=c(3,3),mar=c(4,4,2,2))
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,2],col=mycols,xlab="PC1",ylab="PC2",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,1], SVD_eset_GE$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,1], SVD_eset_Ex$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,1], SVD_eset_IR$v[,3],col=mycols,xlab="PC1",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)
plot(SVD_eset_GE$v[,2], SVD_eset_GE$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="Gene Expression",pch=19,cex=1.5,las=1)
plot(SVD_eset_Ex$v[,2], SVD_eset_Ex$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="AltEx",pch=19,cex=1.5,las=1)
plot(SVD_eset_IR$v[,2], SVD_eset_IR$v[,3],col=mycols,xlab="PC2",ylab="PC3",main="IR",pch=19,cex=1.5,las=1)Do you see any differences in the clustering of the samples using the different count matrices?
Letās now look into the dynamic over time of the different component:
myMean <- list( t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=mean)))) )
mySD <- list(t(apply(SVD_eset_GE$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_Ex$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))),
t(apply(SVD_eset_IR$v,2,function(x)return(tapply(x,INDEX=mygroup,FUN=sd)))) )
#Plot component over time
days <- c(0,3,7,14,21,35)
cats <- c("Gene Expression","AltEx","IR","3UTR")
CEX<- 0.7
par(mfrow=c(3,3),mar=c(5,5,2,0),oma=c(2,2,2,2))
for(j in c(1:3)){
for(i in c(1:3)){
MIN=min(0,min(myMean[[j]][i,]-mySD[[j]][i,]))
MAX=max(myMean[[j]][i,]+mySD[[j]][i,])
plot(days,myMean[[j]][i,],pch=19,type="b",lty=2,ylim=c(MIN,MAX),las=1,frame="F",xlab="time [days]",cex=1.0,cex.axis=CEX,cex.lab=CEX,ylab="")
mtext(side=2,line=3,text=paste("PC",i,sep=""),cex=CEX)
mtext(side=3,line=0,text=cats[j],cex=CEX)
grid()
abline(h=0,col="red",lty=2)
}
}Your homework will be graded according to three criteria: 1) Correctness of your result; 2) Clarity of the visual output; 3) Description of your results demonstrating your ability to discuss your result in their biological context.
Use hierarchical clustering analysis to test whether the time component is differentially captured by changes in gene expression, AltEx, and intron retention the nuclear fractions.
We will use clustering method based on the Manhattan distance and Ward D algorithm:
CEX=1.0
hcl_raw <- hclust(dist(t(ge),method="man"), method = "ward.D", members = NULL)
hcl_ex_nuc <- hclust(dist(t(EX_nuc),method="man"), method = "ward.D", members = NULL)
hcl_ir_nuc <- hclust(dist(t(IR_nuc),method="man"), method = "ward.D", members = NULL)
mytime <- factor(as.character(info$DIV),levels=c(0,3,7,14,22,35))
mycols_days <- c("#CCFF00","#33CC33","#669999","#6699FF","#3300FF","#CC33CC")
names(mycols_days) <- c(0,3,7,14,22,35)
mycols <- unlist(lapply(info$DIV,function(Z)return(mycols_days[match(as.character(Z),names(mycols_days))])))
par(mfrow=c(1,3), mar=c(1,0,3,1))
plot(ape::as.phylo(hcl_raw),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Gene expression")
plot(ape::as.phylo(hcl_ex_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="AltEx")
plot(ape::as.phylo(hcl_ir_nuc),tip.color=mycols,cex=CEX,label.offset = 0.001,no.margin = TRUE,use.edge.length=TRUE,direction="rightwards",plot=TRUE,font=1,main="Intron retention")
From the plots above, we can conclude that time component is
differentially captured by changes in gene expression, since control
samples that belong to the same time points are clustered together.
column_names = colnames(IR_nuc)
row_names = rownames(IR_nuc)
# We perform SVD analyisis on IR_nuc table
dat1 = IR_nuc
SVD_eset = svd(dat1)
# It is common in SVD analysis to remove the first component that captures noise
datm = dat1-SVD_eset$d[1]*(SVD_eset$u[,1]%*%t(SVD_eset$v[,1]))
SVD_eset_IR = svd(datm)
# We use left-singular matrix
u = SVD_eset_IR$u
# We extract first 5 PC values
PC1 = u[,1]
PC2 = u[,2]
PC3 = u[,3]
PC4 = u[,4]
PC5 = u[,5]
# We need to sort them
PC1_sorted = sort(x=PC1,index.return=TRUE,decreasing=TRUE)$ix
PC2_sorted = sort(x=PC2,index.return=TRUE,decreasing=TRUE)$ix
PC3_sorted = sort(x=PC3,index.return=TRUE,decreasing=TRUE)$ix
PC4_sorted = sort(x=PC4,index.return=TRUE,decreasing=TRUE)$ix
PC5_sorted = sort(x=PC5,index.return=TRUE,decreasing=TRUE)$ix
# We extract top 500 positively and negatively contributing genes for each PC
# PC1
top500_p_pc1 = PC1_sorted[1:500]
top500_n_pc1 = tail(PC1_sorted, 500)
top500_p_events_pc1 = IR_nuc[top500_p_pc1, ]
top500_n_events_pc1 = IR_nuc[top500_n_pc1, ]
# PC2
top500_p_pc2 = PC2_sorted[1:500]
top500_n_pc2 = tail(PC2_sorted, 500)
top500_p_events_pc2 = IR_nuc[top500_p_pc2, ]
top500_n_events_pc2 = IR_nuc[top500_n_pc2, ]
# PC3
top500_p_pc3 = PC3_sorted[1:500]
top500_n_pc3 = tail(PC3_sorted, 500)
top500_p_events_pc3 = IR_nuc[top500_p_pc3, ]
top500_n_events_pc3 = IR_nuc[top500_n_pc3, ]
# PC4
top500_p_pc4 = PC4_sorted[1:500]
top500_n_pc4 = tail(PC4_sorted, 500)
top500_p_events_pc4 = IR_nuc[top500_p_pc4, ]
top500_n_events_pc4 = IR_nuc[top500_n_pc4, ]
# PC5
top500_p_pc5 = PC5_sorted[1:500]
top500_n_pc5 = tail(PC5_sorted, 500)
top500_p_events_pc5 = IR_nuc[top500_p_pc5, ]
top500_n_events_pc5 = IR_nuc[top500_n_pc5, ]
par(mfrow=c(2,1))
boxplot(top500_p_events_pc1, main="PSI values over time for genes contributing positively to PC1", ylab="PSI")
boxplot(top500_n_events_pc1, main="PSI values over time for genes contributing negatively to PC1", ylab="PSI")par(mfrow=c(2,1))
boxplot(top500_p_events_pc2, main="PSI values over time for genes contributing positively to PC2", ylab="PSI")
boxplot(top500_n_events_pc2, main="PSI values over time for genes contributing negatively to PC2", ylab="PSI")par(mfrow=c(2,1))
boxplot(top500_p_events_pc3, main="PSI values over time for genes contributing positively to PC3", ylab="PSI")
boxplot(top500_n_events_pc3, main="PSI values over time for genes contributing negatively to PC3", ylab="PSI")par(mfrow=c(2,1))
boxplot(top500_p_events_pc4, main="PSI values over time for genes contributing positively to PC4", ylab="PSI")
boxplot(top500_n_events_pc4, main="PSI values over time for genes contributing negatively to PC4", ylab="PSI")par(mfrow=c(2,1))
boxplot(top500_p_events_pc5, main="PSI values over time for genes contributing positively to PC5", ylab="PSI")
boxplot(top500_n_events_pc5, main="PSI values over time for genes contributing negatively to PC5", ylab="PSI")
Hint: U left-singular matrix provides with loadings of the
genes/transcripts in individual components.
Perform BP enrichment analysis on relevant PCs as identified in previous point.
# We need to extract the events that correspond to the top 500 in each PCi
# PC1
events_pc1_p = row_names[top500_p_pc1]
events_pc1_n = row_names[top500_n_pc1]
genes_pc1_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc1_p, ]$GENE
genes_pc1_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc1_n, ]$GENE
par(mfrow=c(1,1))
# Positively contributing genes
GO_analysis(genes_pc1_p)From the 500 most positively contributing genes to the PC1 in terms of BP enrichment we have the ones related to the pathways of transport and cellular localization. For negatively contributing genes (PC1), BP enriched pathways are organelle localization, cytoskeleton organization, cell projection organization, plasma membrane bounded cell projection organization.
# PC2
events_pc2_p = row_names[top500_p_pc2]
events_pc2_n = row_names[top500_n_pc2]
genes_pc2_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc2_p, ]$GENE
genes_pc2_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc2_n, ]$GENE
# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc2_p)From the 500 most positively contributing genes to the PC2 in terms of BP enrichment we have the ones related to the pathways of organelle organization and transport. For negatively contributing genes (PC2), BP enriched pathways are organelle localization, regulation of nitrogen compound metabolic process, regulation of primary metabolic processes.
# PC3
events_pc3_p = row_names[top500_p_pc3]
events_pc3_n = row_names[top500_n_pc3]
genes_pc3_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc3_p, ]$GENE
genes_pc3_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc3_n, ]$GENE
# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc3_p)From the 500 most positively contributing genes to the PC3 in terms of BP enrichment we have the ones related to the pathways of nervous system development, multicellular organism development and system development. For negatively contributing genes (PC3), BP enriched pathways are vesicle-mediated transport, actin cytoskeleton organization and actin filament-based process.
# PC4
events_pc4_p = row_names[top500_p_pc4]
events_pc4_n = row_names[top500_n_pc4]
genes_pc4_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc4_p, ]$GENE
genes_pc4_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc4_n, ]$GENE
# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc4_p)From the 500 most positively contributing genes to the PC4 in terms of BP enrichment we have the ones related to the pathways of regulation of cellular component organization and biological regulation. For negatively contributing genes (PC4), BP enriched pathways are cell morphogenesis and system development.
# PC5
events_pc5_p = row_names[top500_p_pc5]
events_pc5_n = row_names[top500_n_pc5]
genes_pc5_p = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc5_p, ]$GENE
genes_pc5_n = dat_diff_time_nuc_ctrl[dat_diff_time_nuc_ctrl$EVENT %in% events_pc5_n, ]$GENE
# Positively contributing genes
par(mfrow=c(1,1))
GO_analysis(genes_pc5_p)From the 500 most positively contributing genes to the PC5 in terms of BP enrichment we have the ones related to the pathways of cellular response to stress and biological regulation. For negatively contributing genes (PC5), BP enriched pathways are regulation of primary metabolic process and regulation of nitrogen compound metabolic process.